1 Introduction

The online newspaper sector is one of the most vulnerable to drastic changes especially due to the emergence of new online services and strategies. The challenges are growing , not only in terms of technology developments and distribution channels, but also due the presence of new competitors: social networks. Physical sales are decreasing and new online and social strategies are needed to attract an audience online and to increase ad revenue.
Now, in order to have a good performance in terms of profit and engagement it’s necessary to be active on many platforms. We decided to analyse the South China Morning Post due to their particular position which covers printed newspapers, an active Facebook presence, Instagram and Twitter account. Even though the newspapers sector is now facing many challenges, this journal is one of the most competitive in the world.
The South China Morning Post was established in Hong Kong in 1903 by Tse Tsan-Tai and Alfred Cunningham and published its first newspaper in English on November 6th 1900. 115 years later it is one of the most important news portal in Hong Kong and was acquired by Alibaba in 2015. It set out to be a global media company that reports mainly about Hong Kong and mainland China to an international audience.
SCMP produces content that covers multiple topics such as business, technology, and lifestyle. In order to achieve its mission of becoming the bridge of communication between Asia and the world, SCMP went through a series of transformations in 2018 that has allowed them to reach a broader audience.
In this project we analyzed the Twitter data from SCMP aiming to understand its Social Media Network. Our goal was to understand how the Twitter Users on SCMP’s account behave and then find opportunities on how SCMP can capitalize on its social media strategy. This project was developed as part of Social Media Course grading at Business Analytics Master Course at Hong Kong University.

2 Data Extraction and Transformation

To extract the data from Twitter, we used the package rtweet using the search term “SCMPNews”. The data was extract three times during the period from Dec/18 and Jan/19:
- First Extraction: 18th - December - 2018
- Second Extraction: 25th - December - 2018
- Third Extraction: 1st - January - 2019

The resulting data was merged, and duplicated tweets were excluded. Besides that, from the original data extracted, the team broke it down into three different component files that was used in the further analysis:
- Vertex data: File with information about the user account like number of friends, number of tweets, account language and others.
- Edge data: File with user’s relationship trough tweets, like: retweet, reply, mention and others.
- Tweet Data: Is the original file extract using the Twitter API.

3 Tools

For our analysis, we had the option of choosing one of several software; - R - NodeXL - Gephi - Python

NodeXL had the capability of handling most of the social network analysis we wanted to do. However, due to its limitations as an Excel plug-in in handling large amounts of data, we focused on using R for data extraction, time-series, topic modeling, natural language processing and cluster analysis. Python was used for data wrangling, and sentiment analysis under natural language processing. Gephi was used to perform social network analysis based on the extracted data.

In our final data we have 25.372 uniques tweets, 13.267 users and 50.157 relarionships (We only considered for this analysis tweets in english) In average there are 1103 tweets per day in the data, but we clearly see that on the 18th of December, there were some event that made the amount of tweets increase dramatically. After some analysis we figure out that the trending topic at that day was the “USA and China Trade war”. You can see more details about this in Topic Cluster and User cluster article.

#preparing the data for the visualization: number of tweets per day
time=tweet %>% mutate(day = as.Date(cut(created_at, breaks = "day"))) %>%
      group_by(day) %>% 
      summarise(total = n()) 

time_aux <- tibble::tibble(
      time = seq(as.Date("2018-12-10"), as.Date("2019-01-02"), by = "day"))
time <- left_join(time_aux, time, by = c("time" = "day"))

#using plotly package for a dynamic visualization
plot_ly(time, x = ~time, y = ~total, mode = 'lines', line = list(color = 'rgb(0, 0, 102)', width = 3)) %>%
      add_lines() %>%
      layout(title="Tweets per day") %>%
      rangeslider(time_aux$time[1], time_aux$time[5])

4 Overview

In this context, we know that SCMP is the primary source of content. So, it is reasonable to expect that its social network would be built around its profile. But, what we could see from its graph is that the power of this account is so high that is hard to identify relevant sub-networks in its structure.

When we compare the betweenness centrality from the top users, we see that the difference between SCMP (top 1) and the other ones is huge. Note that this metric is normalized to make easier the comparison.

User* Betweenness Centrality
Top 1 0.99
Top 2 0.01
Top 3 0.01
Top 4 0.01
Top 5 0.01

SCMP profile is “in the middle” of the majority conversations (i.e edges) and that’s why is so hard to find sub-communities in this network, since most of the relationships are directly related to SCMP. After running the modularity clustering available in Gephi, we were able to find 21 different communities. The most important ones are highlighted in dark blue, yellow, light blue, red, black, green and all the other (small) ones are in gray. As one can notice, comparing to the giant component in dark blue where SCMP node is in, none of others sub-networks have high relevance.

5 The Alternative Approach

To come up with more insightful analysis we decided to user an alternative approach by combining topic clustering, user clustering and sentiment analysis to split this giant network into more meaningful sub-networks. If you understand more deeply those analysis you can go through the following articles: ,, ``.

5.1 Topic Modelling

As a huge media company, SCMP is always producing content from a variety of subjects. As we can see from their web site, they even classify it as:

#image

With this said, we decided to identify the natural topics that were commented in SCMP network. Notice that we are considering not only SCMP posts but also all the other users’ tweets. So, the analysis is broader then the content that SCMP produces alone, since it considers the engagement to the topics spread through its network. To achieve this goal, we used some Natural Language Process algorithms to access these topics and find groups of common content. We decided to use LDA (Latent … Algorithm) to this analysis and we found 6 different groups in our data:
- “USA and China”: News related to USA and China relationship. With special attention to the Trade war issue that was quite discussed in this period and that’s the reason why these words are bigger than the others.] - “International”: News related to the international scene. In this part of the cloud we can se the name of some countries like “Canada”, “Japan” and “India”. In this topic the principal news is related to the arrestment of Huawei’s owner daughter in Canada. - “Hong Kong News”: In this topic we have many different topic news but all of them happening or having relationship in Hong Kong. - “Mainland China”: News inside the Mainland China. Some of the popular news in this topic is the ones about the group of Christians that were arrested during this period. - “Sports”: In this case the news that generated most engagement was the one related to some rumors of Japan Olympic games boycott because of the new Japanese police to Whale Hunting. - “Business”: General news about business. In this cloud we can see some terms like: “Jack Ma”, “CEOs”, “wrapping” and “Christmas”.

Interestingly, we didn’t find all the contents that SCMP classify in its website, which means that not all of them produces enough engagement. Probably they reach a very specific kind of user that comparing to the whole network doesn’t play a very important role.

#selecting the important variables. The topic variable was obtained using the *** function. More details about this modeling part are found in Topic Modelling article. 
aux <- tweet %>% select(status_id, text, created_at, topic)

#cleaning the tweet text
aux$stripped_text <- gsub("http.*","",  aux$text)
aux$stripped_text <- gsub("https.*","", aux$stripped_text)
aux$stripped_text <- gsub('[^\x20-\x7E]', '', aux$stripped_text)

topic1 <- aux %>% filter(topic==1) %>% select(stripped_text)
topic1a = paste(topic1$stripped_text, collapse=" ")

topic2 <- aux %>% filter(topic==2) %>% select(stripped_text)
topic2a = paste(topic2$stripped_text, collapse=" ")

topic3 <- aux %>% filter(topic==3) %>% select(stripped_text)
topic3a = paste(topic3$stripped_text, collapse=" ")

topic4 <- aux %>% filter(topic==4) %>% select(stripped_text)
topic4a = paste(topic4$stripped_text, collapse=" ")

topic5 <- aux %>% filter(topic==5) %>% select(stripped_text)
topic5a = paste(topic5$stripped_text, collapse=" ")

topic6 <- aux %>% filter(topic==6) %>% select(stripped_text)
topic6a = paste(topic6$stripped_text, collapse=" ")

all = c(topic1a, topic2a, topic3a, topic4a, topic5a, topic6a)
all = removeWords(all, c(stopwords("english")))

#generating tdm matrix for the visualization
corpus = Corpus(VectorSource(all))
tdm = TermDocumentMatrix(corpus)
tdm2 <- removeSparseTerms(tdm, 0.7)
tdm2 = as.matrix(tdm2)

# add column names
colnames(tdm2) = c("HK News", "International", "USA and China", "Business", "Sports", "Mainland China")

comparison.cloud(tdm2, random.order=FALSE,scale=c(4,0.4), 
colors = c("midnightblue", "darkgoldenrod1", "goldenrod4", "dodgerblue3", "gray44",  "darkorange3"),
title.size=0.8,
max.words=5000) 

5.2 User Cluster

We also developed a user clustering to understand the different kind of users in our data and how they interact with the different topics. For this task we used a K-mean algorithm, with 9 groups. The variables used in this model are: - - - -

5.3 Sentiment Analysis

Another interesting topic in social media is Sentiment Analysis and specially in the context of news this analysis helps understand how users react to each topic. It can also help to understand how the opinion is spread along the relationships trough the network. But even more interestingly is the agreement analysis….. As we can see from the cloud bellow, some words like: “american”, “propaganda”, “trade”, “war” and “China” are very frequent in posts classified as negative. While words like: “Jack Ma”, “founder”, “CEOs” and “bravo” are frequent in positive posts.

some agreement visualization

6 Social Network Graphs

After modeling the topic, user cluster and sentiment, we were able to put all those pieces together and split the data into sub - networks as shown in the following image:

7 Conclusion

8 Recommendations

8.0.1 Title 3

8.0.1.1 Title 4

8.0.1.1.1 Title 5
8.0.1.1.1.1 Title 6

9 Resourses:

Twitter Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY